Introduzione alla programmazione con Triton: Il paradosso delle prestazioni: Perché il codice corretto è lento

Il Paradosso delle prestazioni afferma che un kernel matematicamente perfetto, come $out = x + y$, può effettivamente eseguirsi peggio di un ciclo CPU se non riesce a ripartire i costi fissi dell'hardware GPU. Ciò si manifesta spesso nel Tassa di avvio.

1. L'errore della "correttezza"

La correttezza funzionale non è un indicatore di efficienza. Anche se il tuo codice Triton distribuisce correttamente il lavoro su migliaia di thread, se l'intero carico di lavoro (N) è piccolo, la GPU rimane sottoutilizzata. L'hardware trascorre più tempo nelle transizioni di stato che nell'esecuzione effettiva delle operazioni aritmetiche.

2. L'inghippo della misurazione in Python

Effettuare benchmark del codice GPU da Python usando time.time() è pericoloso. Le chiamate GPU sono asincrone; Python semplicemente accoda il comando e prosegue. Senza torch.cuda.synchronize(), misuri il tempo di accodamento. Con la sincronizzazione, misuri il latenza Host-to-Device, che spesso è 10 volte superiore all'esecuzione del kernel stesso.

3. Latenza vs. Throughput

Per superare il paradosso, devi fornire un carico sufficiente per "nascondere" la latenza di avvio. Si tratta della transizione da un regime a latenza limitata regime (limitato dal bus CPU-GPU) a un regime a throughput limitato regime (limitato dalla memoria o dal calcolo GPU).

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.